RStudio Exercise 4: Clustering and classification

Housekeeping

First, install required packages:

install.packages('plotly', repos = "http://cloud.r-project.org")
install.packages('dplyr', repos = "http://cloud.r-project.org")
install.packages('ggplot2', repos = "http://cloud.r-project.org")
install.packages('GGally', repos = "http://cloud.r-project.org")
install.packages("corrplot", repos = "http://cloud.r-project.org")

Access the required libraries:

library(dplyr)
library(MASS)
library(ggplot2)
library(GGally)

Summary of the data

For this exercise, we are using the Boston dataset, which contains information on housing in Boston Massachusetts. For more information, see this website

We start out by loading the data from the MASS package.

# load the data
data("Boston")

Explore the dataset Boston:

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

Dataset contains 506 observations of 14 variables.

  1. CRIM - per capita crime rate by town
  2. ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
  3. INDUS - proportion of non-retail business acres per town.
  4. CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
  5. NOX - nitric oxides concentration (parts per 10 million)
  6. RM - average number of rooms per dwelling
  7. AGE - proportion of owner-occupied units built prior to 1940
  8. DIS - weighted distances to five Boston employment centres
  9. RAD - index of accessibility to radial highways
  10. TAX - full-value property-tax rate per $10,000
  11. PTRATIO - pupil-teacher ratio by town
  12. BLACK - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
  13. LSTAT - % lower status of the population
  14. MEDV - Median value of owner-occupied homes in $1000’s
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

We can notice that the variable chas is a dummy variable (either 0 or 1), but other variables are continuous and have varying means and variances.

We can see this more clearly by visualizing the data.

Visualize the data

# use ggplot to explore the data
#ggplot(Boston, aes(x=zn)) + geom_histogram(binwidth=50)

ggpairs(Boston, lower = list(combo = wrap("facethist", bins = 20)))

We can also plot the correlation between variables:

library(corrplot)
# calculate the correlation matrix and round it
cor_matrix<-cor(Boston) %>% round(digits = 2)
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)

There appears to be strong correlation between crime (crim) and pupil-teacher-ratio (ptratio) and accessibility to highways (rad)

Standardize the data for the linear discrimination analysis (LDA)

boston_scaled <- scale(Boston) # center and standardize variables

# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

Looking at the scaled data, we can see that each variable is now centered around zero, i.e. the mean is zero by definition, whereas the mean varied in the original data.

# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

## Create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate). 
# Use the quantiles as the break points in the categorical variable.

# summary of the scaled crime rate
summary(boston_scaled$crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))

# look at the table of the new factor crime
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

LDA on the train set

First, we fit the linear discriminant analysis on the train set by using the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables.

# fit the LDA model
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2623762 0.2376238 0.2524752 0.2475248 
## 
## Group means:
##                   zn      indus        chas        nox          rm
## low       1.01669560 -0.9213572 -0.12375925 -0.8811159  0.37251231
## med_low  -0.08325717 -0.2943403 -0.02626030 -0.5776409 -0.09440747
## med_high -0.38467135  0.1608112  0.26805724  0.3442697  0.07616192
## high     -0.48724019  1.0149946 -0.03610305  1.0654765 -0.30072478
##                 age        dis        rad        tax     ptratio
## low      -0.9279974  0.9305303 -0.6806695 -0.7195178 -0.44093036
## med_low  -0.3488857  0.3748458 -0.5464108 -0.4954872 -0.07905908
## med_high  0.4142751 -0.3341579 -0.4132674 -0.3185538 -0.19682797
## high      0.8055068 -0.8497270  1.6596029  1.5294129  0.80577843
##               black       lstat        medv
## low       0.3767145 -0.74951161  0.45874010
## med_low   0.3157161 -0.16424098  0.03765972
## med_high  0.0753590  0.02274549  0.13639604
## high     -0.8743343  0.91646554 -0.70400490
## 
## Coefficients of linear discriminants:
##                 LD1          LD2         LD3
## zn       0.11672740  0.717439041 -0.94743492
## indus    0.02136662 -0.242137748  0.51733807
## chas    -0.07871185 -0.026728267  0.01456195
## nox      0.26805631 -0.723924494 -1.54050530
## rm      -0.09867890 -0.096302305 -0.11996983
## age      0.26862233 -0.392784783 -0.08947827
## dis     -0.10572993 -0.334489876  0.24939160
## rad      3.26047480  1.028063848  0.12328271
## tax      0.14080661 -0.032409449  0.43082543
## ptratio  0.16947958 -0.054511632 -0.36276707
## black   -0.15500250  0.009488418  0.09620898
## lstat    0.24937837 -0.340035108  0.31465904
## medv     0.22740349 -0.480955322 -0.17422203
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9513 0.0380 0.0106

We can see that the first linear discriminant (LD1) explains most of the differences between groups, as the proportio of the trace is the very large (0.9563)

We can see this more clearly by plotting the results:

# create the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

First linear discriminant (LD1) explains the high crime areas, but cannot predict differences between the lower crime neighborhoods. Second linear discriminant (LD2) can explain some differences between with the low crime neighbordhoods, with a lower value of LD2 associated with a higher crime levels.

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

summary(test)
##        zn              indus               chas         
##  Min.   :-0.4872   Min.   :-1.55630   Min.   :-0.27233  
##  1st Qu.:-0.4872   1st Qu.:-0.79322   1st Qu.:-0.27233  
##  Median :-0.4872   Median :-0.18028   Median :-0.27233  
##  Mean   :-0.1158   Mean   : 0.07861   Mean   :-0.07933  
##  3rd Qu.:-0.4872   3rd Qu.: 1.01499   3rd Qu.:-0.27233  
##  Max.   : 3.5861   Max.   : 2.42017   Max.   : 3.66477  
##       nox                 rm               age                dis         
##  Min.   :-1.33499   Min.   :-3.8764   Min.   :-2.33313   Min.   :-1.2446  
##  1st Qu.:-0.90350   1st Qu.:-0.7293   1st Qu.:-0.59771   1st Qu.:-0.8844  
##  Median :-0.09230   Median :-0.2151   Median : 0.43430   Median :-0.5102  
##  Mean   : 0.07048   Mean   :-0.0796   Mean   : 0.08876   Mean   :-0.1526  
##  3rd Qu.: 0.64339   3rd Qu.: 0.4830   3rd Qu.: 0.94676   3rd Qu.: 0.5353  
##  Max.   : 2.72965   Max.   : 3.4733   Max.   : 1.11639   Max.   : 2.0034  
##       rad                 tax             ptratio        
##  Min.   :-0.981871   Min.   :-1.3068   Min.   :-2.51994  
##  1st Qu.:-0.637331   1st Qu.:-0.7980   1st Qu.:-0.48756  
##  Median :-0.522484   Median :-0.2209   Median : 0.11292  
##  Mean   : 0.007837   Mean   : 0.0332   Mean   :-0.06052  
##  3rd Qu.: 1.659603   3rd Qu.: 1.5294   3rd Qu.: 0.80578  
##  Max.   : 1.659603   Max.   : 1.7964   Max.   : 1.26768  
##      black             lstat               medv         
##  Min.   :-3.8685   Min.   :-1.35597   Min.   :-1.90634  
##  1st Qu.: 0.2809   1st Qu.:-0.81088   1st Qu.:-0.63148  
##  Median : 0.3995   Median :-0.08095   Median :-0.10686  
##  Mean   : 0.0932   Mean   : 0.01224   Mean   : 0.04163  
##  3rd Qu.: 0.4406   3rd Qu.: 0.49775   3rd Qu.: 0.48300  
##  Max.   : 0.4406   Max.   : 3.09715   Max.   : 2.98650
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       10       8        3    0
##   med_low    7      12       11    0
##   med_high   0       5       18    1
##   high       0       0        1   26

The results in the above table show that the model predicts high crime areas well, but low, medium-low and medium-high crime areas are more difficult to predict. The model is unable to place low crime areas into the low category, given that it predicts low areas to belong to low and medium-low areas with the same probability.

K-means

#Reload the data from the MASS package
library(MASS)
data('Boston')

#Scale the data
boston_scaled <- scale(Boston)

Next, we calculate the distances between variables

# euclidean distance matrix
dist_eu <- dist(boston_scaled)

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(boston_scaled, method = 'manhattan')

# look at the summary of the distances
summary(dist_man)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618

We can see that the Euclidean distances are shorter than the Manhattan distances, because the Euclidean distance calculates the shortest distance between variables, whereas the Manhattan distance measure the distance between variables along the axes at right angles.

Next, we look at the results from k-means clustering with optimal number of clusters

set.seed(123) # Fix random number generator

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

Because, the total within sum-of-squares decreases the fastest between one and two clusters, we perform k-means clustering with two centers.

# k-means clustering
km <-kmeans(boston_scaled, centers = 2)
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

Then we plot the original dataset with the clusters

# plot the original Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)

# k-means clustering
km <-kmeans(boston_scaled, centers = 4)
summary(km)
##              Length Class  Mode   
## cluster      506    -none- numeric
## centers       56    -none- numeric
## totss          1    -none- numeric
## withinss       4    -none- numeric
## tot.withinss   1    -none- numeric
## betweenss      1    -none- numeric
## size           4    -none- numeric
## iter           1    -none- numeric
## ifault         1    -none- numeric
# plot the Boston dataset with clusters
pairs(boston_scaled[, 1:6], col = km$cluster)

# plot the Boston dataset with clusters
pairs(boston_scaled[, 7:14], col = km$cluster)

Bonus

Next, we perform linear discriminant analysis with the Boston data by using the clusters from the k-means testing as the target variables.

# Read data
data(Boston)
head(Boston)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12
##   lstat medv
## 1  4.98 24.0
## 2  9.14 21.6
## 3  4.03 34.7
## 4  2.94 33.4
## 5  5.33 36.2
## 6  5.21 28.7
# Scale data
boston_scaled <- scale(Boston)

# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)


# Kmeans with three centers
km1 <-kmeans(boston_scaled, centers = 4)

lda.fit1 <- lda(km1$cluster ~ ., data = boston_scaled)


# target classes as numeric
classes <- as.numeric(km1$cluster)

# plot the lda results
plot(lda.fit1, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit1, myscale = 1)

We can see that the linear discriminants are useful at explaining the different clusters along the two dimensions, because the clusters are quite separate from one another.

Super Bonus

Use package plotly to create 3D images

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')